Rapid transition to new spoken dialogue domains: language model training using knowledge from previous domain applications and web text resources

نویسندگان

  • Murat Akbacak
  • Yuqing Gao
  • Liang Gu
  • Hong-Kwang Jeff Kuo
چکیده

In generic automatic speech recognition (ASR) systems, typically, language models (LMs) are trained to work within a broad range of input conditions. ASR systems used in domainspecific spoken dialogue systems (SDSs) are more constrained in terms of content and style. A mismatch in content and/or style between training and operating conditions results in performance degradation for the dialogue application. The main focus of this paper is to develop tools to facilitate rapid development of spoken dialogue applications within the context of language model training by focusing on the problem of automatically collecting text data that is useful to train accurate language models for the new target domain without manually collecting any in-domain data. We investigate a framework to extract useful information from previous domains and World Wide Web (WWW). We collect data by submitting queries to a search engine and then clean the resulting text via syntactic and semantic filtering. This is followed by artificial sentence generation. Without using any in-domain data, our system achieved a word error rate (WER) of 19.33%, a performance comparable to that achieved by a language model trained on manually collected 32K in-domain sentences. Using less than 1% of in-domain data along with the automatically generated text, our system achieved an ASR performance close to a language model trained on 60K in-domain sentences.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Rapid Development Process of Spoken Dialogue Systems using Collaboratively Constructed Semantic Resources

We herein propose a method for the rapid development of a spoken dialogue system based on collaboratively constructed semantic resources and compare the proposed method with a conventional method that is based on a relational database. Previous development frameworks of spoken dialogue systems, which presuppose a relational database management system as a background application, require complex...

متن کامل

A bootstrapping approach for developing language model of new spoken dialogue systems by selecting web texts

This paper proposes a bootstrapping method of constructing statistical language models for new spoken dialogue systems by collecting and selecting sentences from the World Wide Web (WWW). To make effective search queries that cover the target domain in full detail, we exploit the document set described about the target domain as seeding data. An important issue is how to filter the retrieved We...

متن کامل

Spoken language understanding using layered n-gram modeling

This paper presents an approach which integrates layer concept information into the trigram language model in order to improve the understanding accuracy for spoken dialogue systems and to improve the portability of the language modeling materials among different narrow-domain applications. With this approach, both the recognition accuracy and out-of-grammar problem can be largely improved, and...

متن کامل

On-Line Learning of a Persian Spoken Dialogue System Using Real Training Data

The first spoken dialogue system developed for the Persian language is introduced. This is a ticket reservation system with Persian ASR and NLU modules. The focus of the paper is on learning the dialogue management module. In this work, real on-line training data are used during the learning process. For on-line learning, the effect of the variations of discount factor (g) on the learning speed...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005